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Abstract. In this article, hybrid parallel bidirectional sieve is imple- 
mented by SMP Cluster, the individual computational units joined to- 
gether by the communication network, are usually shared-memory sys- 
tems with one or more multicore processor. To high-efflciency optimiza- 
tion, we propose average divide data into nodes, generating double-ended 
queues (deque) for sieve that are able to exploit dual-cores simultane- 
ously start sifting out primes from the head and tail. And each node 
create a FIFO queue as dynamic data buffer to ache temporary data 
from another nodes send to. The approach obtains huge speedup and 
efhciency on SMP Cluster. 
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1 Introduction 

Research into questions involving primes continues today, partly driven by the 
importance of primes in modern cryptography. As our computational power in- 
creases, researcher often pays more attention to Data analysis, Climate modeling, 
Protein folding. Drug discovery etc. We can also exploit multicores to efficiency 
solve some problem in the field of number theory. 

M.Aigner and G.M.Ziegler ^Ij presented six quite different proofs of the in- 
finitude of primes. Mills [2] has shown that there is a constant O such that the 
function /(n) = [Q^"] generates only primes. The sieve of Eratosthenes-Legendre 
[3] [4] is an ancient algorithm for finding all prime numbers up to any given limit. 
In number theory, tests distinguishing between primes and composite integers 
will be crucial. The most basic primality test is trial division, which tells us that 
integer n is prime if and only if it is not divisible by any prime not exceeding 
y/n. 

The computational complexity of algorithms for determining whether an in- 
teger is prime is measured in terms of the number of binary digits in the integer. 
The algorithm using trial divisions to determine whether an integer n is prime 
is exponential in terms of the number of binary digits of n, or in terms of log2 n 
,because = 2^°a2n/2_ 

As n gets large, an algorithm with exponential complexity quickly becomes 
impractical. Leonard Adleman, Carl Pomerance, and Robert Rumely [5] |6] de- 
veloped an algorithm that can prove an integer is prime using (logn)°'°^'°^'°^"' 
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nit operations, where c is a constant. In 2002, M. Agrawal, N. Kayal, and N. 
Saxena [7], announced that they had found an algorithm PRIMES is in P that 
can produce a certificate of primahty for an integer n using 0{{logn)^^) bit 
operations. 

Karl Friedrich Gauss conjectured that tt{x) increases at the same rate as the 
functions 7^ and Li(x) = /If 7^. And the Prime Number Theorem that the 
ratio of Tr{x) to approaches 1 as a; grows without bound. One way [TT] to 

evaluate Tr{x) only 0{xi~^'^) bit operations without finding all the primes less 
than X is to use a counting argument based on the sieve of Eratosthenes. 

In this paper. Hybrid parallel bidirectional sieve based on SMP Cluster is 
proposed to improve efficient and speedup. The result is proved to be effective 
by MPI and OpenMP [8] [9] [10]. With Hybrid parallel, it has far-reaching sig- 
nificance in cryptography. 



2 Communication and Optimization 

ILP and TLP provide parallelism at a very low level, they are typically controlled 
by the processor and the operating system, and isn't directly controlled by the 
programmer. Parallel hardware is often classified using Flynn's taxonomy, which 
distinguished between the number of instruction streams and the number of 
data streams a system can handle. A von Neumann system is classified as SISD. 
Vector processors and graphics processing units (GPU) are often classified as 
SIMD. MIMD execute multiple independent instruction streams, each of which 
can have its own data stream. Shared-memory or distributed-memory is typically 
MIMD. And most of the lager MIMD systems are hybrid systems (Fig[T]) in which 
a number of relatively small share-memory are connected by an interconnection 
network. In such systems, the individual shared-memory systems are sometimes 
called nodes. 



SMP node SMP node n-1 



processors memory 



processors m^^^^^^ 













Interconnection Network 

Fig. 1. SMP Cluster Architecture. 
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2.1 Interconnection networks 

Currently the two most widely used interconnects on shared-memory systems 
are buses and crossbars [15]. The key characteristic of a bus is that the com- 
munication wires are shared by the devices that are connected to it. Buses have 
the virtue of low cost and flexibility. Crossbars (Figl2]) allow simultaneous com- 
munication among different devices, so they are much faster than uses. But the 
cost of the switches and links is relatively high. 

Distributed-memory interconnects are often divided into two groups: direct 
interconnects and indirect interconnects. One measure of "number of simultane- 
ous communications" or "connectivity" is bisection width. To understand this 
measure, imagine that the parallel system is divided into two halves, and each 
half contains half of the processors or nodes. An alternate way of computing the 
bisection width is to remove the minimum number of links needed to split the 
set of nodes into two equal halves. 




Fig. 2. Shared-memory system simultaneous memory access 



The hypercube (Fig|31) is a highly connected direct interconnect that has 
been used in actual system. A hypercube of dimension d has p = 2'^ nodes, 
and a switch in a d-dimensional hypercube is directly connected to a processor 
and d switches. The bisection width of a hypercube is |.The switches support 
1 + d = 1 + log2P wires. The hypercube is more powerful and expensive to 
construct. 

The crossbar and the omega network are relatively simple examples of indi- 
rect networks. The omega network (FigU) is less expensive than crossbar. The 
omega network uses \plog2{p) of the 2x2 crossbar switches, so it uses a total 
of 2plog2{p) switches, while the crossbar users p^. 
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Fig. 3. (a) two-dimensional hypercube (b) three-dimensional hypercube (c) four- 
dimensional hypercube 

2.2 Hybrid Parallelism 

We define tlie speedup of a parallel program to be 5 = JT"'''''' . Then linear 

^parallel 

speedup has S = Pcores, this value, -p, is sometimes called the efficiency of the 
parallel program as follows: 

E=-^ Ie^i^ (1) 
P P ^ ' 

Back in the 1960s, Gene Amdahl [13] that's become as Amdahl's Law: 

Soverall = — — " f (2) 

It means that unless virtually all of a serial program is parallelized, the 
possible speedup is going to be very limited-regardless of the number of cores 
available. A more mathematical version of this statement is known as Gustafson's 
Law [13]. 

Unfortunately, there are several mismatch problem between the (hybrid) pro- 
gramming schemes and the hybrid hardware architecture. Often, one can see in 
publications, that applications may or may not benefit from hybrid programming 
depending on some apphcation parameters, e.g., in [16][T3[T5 [19]. 

Poll Rabenseifner analyses strategies to overcome typical drawbacks of this 
easily usable programming scheme on systems with weaker inter-connects ^U\ . 
Best performance can be achieved with overlapping communication and compu- 
tation, but this scheme is lacking in ease of use. Often, hybrid MPI + OpenMP 
programming denotes a programming style with OpenMP shared memory par- 
allelization inside the MPI processes (i.e., each MPI process itself has several 
OpenMP threads) and communication with MPI between the MPI processes, 
but only outside of parallel regions. 

This hybrid programming scheme will be named materonly in the following 
classification, which is based on the question, when and by which thread(s) the 
messages are sent between the MPI processes: 



Pure MPI 
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. Hybrid MPI + OpenMP 

. Overlapping communication and computation 
. Pure OpenMP 

Overlapping of communication and computation is a chance for an optimal 
usage of the application itself, in the OpenMP parallelization and in the load 
balancing. It requires a coarse-grained and thread-rank-based OpenMP paral- 
lelization, the separation of halo-based computation from the computation that 
can be overlapped with communication, and the threads with different tasks 
must be load balanced. Advantages of the overlapping scheme are: 

. the problem that one CPU may not achieve the inter-node bandwidth is no 
longer relevant as long as there is enough computational work that can be 
overlapped with the communication 

. the saturation problem is solved as long as not more CPUs communicate in 
parallel than necessary to achieve the inter-node bandwidth 

. the sleeping threads problem is solved as long as all computation and com- 
munication is load balanced among the threads. 




3 Bidirectional Sieve Model 

Foster's methodology tl2j provides an outline of steps include 
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. Partitioning. 

. Communication. 

. Agglomeration or aggregation 

. Mapping for parallel programming 

3.1 Algorithm Design 

The sieve of Eratosthenes does so by iteratively marking as composite the mul- 
tiples of each prime, starting with the multiples of 2 [3]. We can exploit and 
improve the sieve of Eratosthenes based on SMP Cluster (Fig. [5]). Assume that 
there are some disorder integers which the scale of n, and when each node sieve 
the integers in the block that the scale of k, it could achieve high- efficiency op- 
timization. We conjectured that the SMP Cluster requires at least N nodes. The 
formula as follows: 

77 

AT ^ + (n mod k) & 1 (3) 
k 

And each node generate one deque and do with dual-cores. One core is located 
in the head of the deque. On the contrary, the other one is located in the tail of 
the deque. It's easy to deduction the formula about the amount of cores(Ccores) 
and deques{D deques)- 

Scores — -^deques — 2A^ (4) 

There is another point that's worth considering. In most cases, the scale of 
node N is not exactly equal k. We can deal with the state as follows AlgHJ 



Algorithm 1 the scale of node N 

Require: K denote that the currency scale of node A*'*'' 
Ensure: k denote that the general scale of node 

if < A' < I then 

Node N assign single core to right or left sieve 

else 

Node N assign dual-cores to simultaneous bidirectional sieve 
end if 



And its flow diagram is shown in FiglHl 
3.2 Primality Testing : Non-deterministic 

Primality testing of a number is perhaps the most common problem concerning 
number theory.The problem of detecting whether a given number is a prime 
number has been studied extensively but nonetheless, it turns out that all the 
deterministic algorithms for this problem are too slow to be used in real life situ- 
ations and the better ones amongst them are tedious to code. But, there are some 
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probabilistic methods which are very fast and very easy to code. Moreover, the 
probabihty of getting a wrong resuh with these algorithms is so slow that it can 
be neglected in normal situations. 

All the algorithms which we are going to discuss will require you to efficiently 
compute (a^) mod c (where a,b,c are non-negative integers). A straightforward 
algorithm to do the task can be to iteratively multiply the result with a and 
take the remainder with c at each step, this algorithm takes 0{b) time and is 
not very useful in practice. We can do it 0(log5) by using what is called as 
exponentiation by squaring as follows: 

{(a^)^, if b is even and b > 
a(a^)^ , if b is odd 
1, iffe = 

Pierre de Fermat first stated the Fermat's Little Theorem in a letter dated 
October 18, 1640, to his friend and confidant Frenicle de Bessy as the following 

qP — a (mod p) (5) 

or alternatively: 

flP-i = 1 (mod p) (6) 

According to Fermat's Little Theorem[7], if p is a prime number and a is 
positive integer less than p {a < p),and then calculate a^"^ mod p. If the result 
is not 1, then by Fermat's Little Theorem p cannot be prime. The more iterations 
we do, the higher is the probability that our result is correct. 



8 



\ 




Data Analysis 




s 





Divide and Conquer 



N = n/k + 


(n % k) & 1 




\ 




N dual-cores in SMP Cluster 




i 


N deque and N FIFO queue 











Get La and Rb 




from 


queue 



I Computing and ] 

[ Communication overlap 




Fig. 6. High-level flow diagram of hybrid parallel bidirectional Sieve 
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Algorithm 2 niodulo(a,b,c) : Exponentiating by squaring to (a^) mod c 
Require: x = l,y = a 
Ensure: {a'') mod c 
while 6 > do 
if fe & 1 then 

X = {x * y) mod c 
end if 

y = {y * y) mod c 

b »= 1 
end while 
return x mod c 



Algorithm 3 Fcrmat(p, iterations) : Fermat's primality test 
if p = 1 then 

return false 
end if 

for i := 1 to iterations do 

a — rand{) mod (p — 1) + 1 

if modulo(a,p-l,p)!=l (AlgH)) then 
return false 

end if 
end for 
return true 



Though Fermat is highly accurate in practice there are certain composite 
numbers p known as Carmichael numbers for which all values oi a < p for which 
gcd{a,p) — l,{aP^^) modp = l.And in that case,the Fermat's test will return 
wrong result with very high probability. Out of the Carmichael numbers less than 
10^^, about 95% of them are divisible by primes < 1000. However, there are other 
improved primality tests which don't have this flaw as Fermat's(e.g. Rabin-Miller 
test[2l][22],Solovay-Strassen test [23]). 

4 Performance Analysis 

Different programming schemes on clusters of SMPs show different performance 
benefits or penalties in this paper. Fig 17] summarizes the result of hybrid paral- 
lel bidirectional sieve .It's obvious that nodes communication would waste most 
of time when data scale is tiny.Even its slower than general method. However, 
if there are hyper-data scale, hybrid parallel show huge efficiency and optimiza- 
tion. Indeed, sometimes the waste of communication could be neglected. In that 
case,multicores parallelism is effective approach to solve some problem in number 
theory. 

To achieve an optimal usage of the hardware, one can also try to use the idling 
CPU's for other applications, especially low-priority single-threaded or multi- 
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dalascale 




Fig. 7. statistics and analysis liybrid parallel bidirectional sieve with general method 

threaded non-MPI application if the parallel high-priority hybrid application 
does not use the total memory of the SMP nodes. 

5 Conclusion 

In this study we haven shown that hybrid parallel on SMP cluster is an applicable 
method to implement bidirectional sieve . The analysis demonstrated that even 
hybrid parallel bidirectional sieve is efficiency and optimization solution. 

As our computational power increases, Most HPC system are clusters of 
shared memory nodes. Parallel programming must combine the distributed mem- 
ory parallelization on the node inter-connect with shared memory parallelization 
inside of each node. And Each parallel programming schema on hybrid architec- 
ture has one or more significant drawbacks (e.g. sleeping-thread and saturation 
problem). However, Hybrid parallel also has far-reaching significance in many 
fields(e.g.Cryptography,Data analysis. Climate modeling. Protein folding. Drug 
discovery). 

We believe that hybrid parallel bidirectional sieve can be properly modeled 
using techniques form number theory and this article is just an early trial of 
using hybrid parallelism to improve speedup and efficiency. 
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